EN FR
EN FR


Section: New Results

Acoustic-Articulatory Mapping

In this series of studies, we tackle the problem of adapting an acoustic-articulatory inversion model of a reference speaker to the voice of another source speaker. We exploited the framework of Gaussian mixture regressors (GMR) with missing data. To address speaker adaptation, we previously proposed a general framework called Cascaded-GMR (C-GMR) which decomposes the adaptation process into two consecutive steps: spectral conversion between source and reference speaker and acoustic-articulatory inversion of converted spectral trajectories. In particular, we proposed the Integrated C-GMR technique (IC-GMR) in which both steps are tied together in the same probabilistic model. In [34], [43], we extend the C-GMR framework with another model called Joint-GMR (J-GMR). Contrary to the IC-GMR, this model aims at exploiting all potential acoustic-articulatory relationships, including those between the source speaker's acoustics and the reference speaker's articulation. We present the full derivation of the exact Expectation-Maximization (EM) training algorithm for the J-GMR. It exploits the missing data methodology of machine learning to deal with limited adaptation data. We provide an extensive evaluation of the J-GMR on both synthetic acoustic-articulatory data and on the multi-speaker MOCHA EMA database. We compare the J-GMR performance to other models of the C-GMR framework, notably the IC-GMR, and discuss their respective merits. We also exploited the IC-GMR framework with visual data to provide visual biofeedback [32]. Visual biofeedback is the process of gaining awareness of physiological functions through the display of visual information. As speech is concerned, visual biofeedback usually consists in showing a speaker his/her own articulatory movements, which has proven useful in applications such as speech therapy or second language learning. We automatically animate an articulatory tongue model from ultrasound images. We benchmarked several GMR-based techniques on a multispeaker database. The IC-GMR approach is able (i) to maintain good mapping performance while minimizing the amount of adaptation data (and thus limiting the duration of the enrollment session), and (ii) to generalize to articulatory configurations not seen during enrollment better than the plain GMR approach. As a result, the GMR appears to be a good mapping technique for non-linear regression tasks, and in particular for those requiring adaptation (either using J-GMR or IC-GMR).